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(57) Abstract: A video processing system tracks an object of interest using a hybrid combination of (i) optical zooming by a pan- 
tilt-zoom (FTZ) camera, and (ii) virtual zooming on an image generated by that camera. The object of inta:est (22-ifc) is initially 
detected in an image (40) generated by the camera (18). An optical zooming operation (34) then adjusts pan and tilt settings to 
frame the object of interest (22-A:), and zooms in on the object of interest (22-k) until one or more designated stopping criteria are 
met. A virtual zooming operation (36) processes the resulting optically-zoomed image (44) to identify and extract a particular region 
of interest (47), and then interpolates the extracted region of interest to generate a virtually-zoomed image (46). The designated 
stopping criteria may indicate, e.g., that the optical zooming continues until the object of interest {22-k) occupies a fixed or dynamic 
percentage of the resulting optically-zoomed image. 
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Real-time tracking of an object of interest using a hybrid optical and virtual zooming 
mechanism. 



The present invention relates generally to the field of video signal processing, 
and more particularly to techniques for tracking persons or other objects of interest using a 
video camera such that a desired video ou^ut can be achieved. 

Tracking a person or other object of interest is an important aspect of video- 

5 camera-based systems such as video conferencing systems and video surveillance systems. 
For example, in a video conferencing system, it is often desirable to firame the head and 
shoulders of a particular conference participant in the resultant output video signal, while in a 
video surveillance system, it may be desirable to frame the entire body of, e.g., a person 
entering or leaving a restricted area monitored by the system. 

1 0 Such systems generally utilize one of two distinct approaches to implement 

tracking of an object of interest. The first approach uses a pan-tilt-zoom (PTZ) camera that 
allows the system to position and optically zoom the camera to perform the tracking task. A 
problem with this approach is that, in some cases, the tracking mechanism is not sufficiently 
robust to sudden changes in the position of the object of interest. This may be due to the fact 

1 5 that the camera is often being zoomed-in too far to react to the sudden changes. For example, 
it is not uncommon in a video conferencing system for participants to move within their seats, 
e.g., to lean forward or backward, or to one side or the other. If a PTZ camera is zoomed-in 
too far on a particular participant, a relatively small movement of the participant may cause 
the PTZ camera to lose track of that participant, necessitating zoom-out and re-track 

20 operations that will be distracting to a viewer of the resultant output video signal. 

The second approach is referred to as Avirtual zoomn or Aelectronic zoom.= 
In this approach, video information from one or more cameras is processed electronically such 
that the object of interest remains visible in a desired configuration in the output video signal 
despite the fact that the object may not be centered in the field of view of any particular 

25 camera. U.S. Patent No. 5,187,574 discloses an example of such an approach, in which an 
image of an arriving guest is picked up by a fixed television camera of a surveillance system. 
The image is processed using detection, extraction and interpolation operations to ensure that 
the head of the guest is always displayed at the center of the monitor screen. This approach 
ensures that the video output has a desired form, e.g., is centered on an object of interest. 
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without the need for pan, tilt or zoom operations. As a result, this approach can operate with 
fixed cameras, which are generally significantly less expensive than the above-noted PTZ 
cameras. However, this approach fails to provide the output image quality required in many 
applications. For example, the extraction and interpolation operations associated with virtual 
5 zooming will generally result in a decreased resolution and image quality in the resultant 

output video signal, and therefore may not be suitable for video conferencing or other similar 
applications. 

As is apparent fi-om the above, a need exists for an improved tracking technique 
which can provide the output video signal quality and resolution associated with the PTZ 

1 0 camera approach as well as the flexibility of the virtual zoom approach, while also avoiding 
the problems generally associated with these approaches. 

The invention provides methods and apparatus for real-time tracking of an 
object of interest in a video processing system, using a hybrid combination of (i) optical 
zooming by a pan-tilt-zoom (PTZ) camera, and (ii) virtual zooming on an image generated by 

1 5 that camera. In an illustrative embodiment of the invention, the object of interest is initially 
detected in an image generated by the camera. An optical zooming operation then adjusts pan 
and tilt settings to firame the object of interest, and zooms in on the object of interest imtil one 
or more designated stopping criteria are met. A virtual zooming operation processes the 
resulting optically-zoomed image to identify and extract a particular region of interest, and 

20 then interpolates the extracted region of interest to generate a virtually-zoomed image. 

In accordance with one aspect of the invention, the designated stopping criteria 
may indicate, e.g., that the optical zooming continues until the object of interest occupies a 
fixed or dynamic percentage of the resulting optically-zoomed image. In the case of a 
dynamic percentage, the percentage may vary as a function of a detected quality associated 

25 with the object of interest. Examples of such detected qualities include a level of apparent 
motion, a use of a particular audibly-detectable key word or other cue, and a change in 
intensity, pitch or other voice quality. 

In accordance with another aspect of the invention, the virtual zooming 
operation may be repeated on the resulting optically-zoomed image, using the same pan, tilt 

30 and zoom settings established in the optical zooming operation, if a level of movement of the 
object of interest exceeds a first designated threshold. The optical zooming operation itself 
may be repeated in order to establish new pan, tilt and zoom settings for the camera if the level 
of movement of the object of interest exceeds a second designated threshold higher than the 
first threshold. 
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The hybrid optical and virtual zoom mechanism of the invention provides a 
number of significant advantages over conventional approaches. For example, the hybrid 
mechanism accommodates a certain amoimt of movement of the object of interest without the 
need to determine new optical pan, tilt and zoom settings, while also providing a desired 
5 output image quality level. By preventing the PTZ camera from zooming in too far, the 
invention ensures that the PTZ camera settings are adjusted less fi^quently, and the 
computational load on the system processor is thereby reduced relative to that required by a 
conventional optical zoom approach. In addition, the hybrid mechanism of the invention can 
provide an improved compression rate for image transmission. These and other features and 

1 0 advantages of the present invention will become more apparent fix)m the accompanying 
drawings and the following detailed description. 

Fig. 1 is a block diagram of a video processing system in accordance with an 
illustrative embodiment of the invention. 

Fig. 2 is a functional block diagram illustrating hybrid real-time tracking video 

1 5 processing operations implemented in the system of Fig. 1 . 

Fig. 1 shows a video processing system 10 in accordance with an illustrative 
embodiment of the invention. The system 10 includes a processor 12, a memory 14, an 
input/output (I/O) device 1 5 and a controller 16, all connected to conununicate over a system 
bus 17. The system 10 further includes a pan-tilt-zoom (PTZ) camera 18 which is coupled to 

20 the controller 16 as shown. In the illustrative embodiment, the PTZ camera 18 is employed in 
a video conferencing application in which a table 20 accommodates a number of conference 
participants 22-1 , 22-A:, 22-N. In operation, the PTZ camera 1 8, as directed by the 
controller 16 in accordance with instructions received from the processor 12, tracks an object 
of interest which in this example application corresponds to a particular participant 22-^. The 

25 PTZ performs this real-time tracking function using a hybrid optical and virtual zooming 
mechanism to be described in greater detail below in conjunction with Fig. 2. 

Although the invention will be illustrated in the context of a video conferencing 
application, it should be vmderstood that tiie video processing system 10 can be used in a wide 
variety of other applications. For example, the portion 24 of the system 10 can be used in 

30 video surveillance applications, and in other types of video conferencing applications, e.g., in 
applications involving congress-like seating arrangements, circular or rectangular table 
arrangements, etc. More generally, the portion 24 of system 10 can be used in any application 
which can benefit firom the improved tracking function provided by a hybrid optical and 
virtual zoom mechanism. The portion 26 of the system 10 may therefore be replaced with. 
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e.g., other video conferencing arrangements, video surveillance arrangements, or any other 
arrangement of one or more objects of interest to be tracked using the portion 24 of the system 
10. It will also be apparent that the invention can be used with image capture devices other 
than PTZ cameras. The term Acameras as used herein is therefore intended to include any 
5 type of image capture device which can be used in conjunction with a hybrid optical and 
virtual zooming mechanism. 

It should be noted that elements or groups of elements of the system 10 may 
represent corresponding elements of an otherwise conventional desktop or portable computer, 
as well as portions or combinations of these and other processing devices. Moreover, in other 

10 embodiments of the invention, some or all of the functions of the processor 12 , controller 16 
or other elements of the system 10 may be combined into a single device. For example, one or 
more of the elements of system 10 may be implemented as an application specific integrated 
circuit (ASIC) or circuit card to be incorporated into a computer, television, set-top box or 
other processing device. The term Aprocessor^ as used herein is intended to include a 

1 S microprocessor, central processing unit, microcontroller or any other data processing elernent 
that may be utilized in a given data processing device. In addition, it should be noted that the 
memory 14 may represent an electronic memory, an optical or magnetic disk-based memory, a 
tape-based memory, as well as combinations or portions of these and other types of storage 
devices. 

20 Fig. 2 is a functional block diagram illustrating a hybrid optical and virtual.^ 

zoom mechanism 30 implemented in the system 10 of Fig. 1. Again, although illustrated^n 
the context of a video conferencing application, it will be apparent that the techniques 
described are readily applicable to any other tracking application. As shown in Fig. 2, the 
hybrid optical and virtual zoom mechanism 30 includes a detection and tracking operation 32, 

25 an optical zooming operation 34, and a virtual zooming operation 36. These operations will 
be described with reference to images 40, 42, 44 and 46 which correspond to images generated 
for the exemplary video conferencing application in portion 26 of system 10. The operations 
32, 34 and 36 may be implemented in system 10 by processor 12 and controller 16, utilizing 
one or more software programs stored in the memory 14 or accessible via the I/O device 15 

30 from a local or remote storage device. 

In operation, PTZ camera 1 8 generates image 40 which includes an object of 
interest, i.e., video conference participant 22-/:, and an additional object, i.e., another 
participant 22-^+1 adjacent to the object of interest. The image 40 is supplied as a video input 
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to the detection and tracking operation 32, which detects and tracks the object of interest 22-* 
using well-known conventional detection and tracking techniques. 

For example, in the video conferencing application, the object of interest 22-k 
may correspond to the current speaker. In this case, the detection and tracking operation 32 
5 may detect and track the object of interest 22-* using techniques such as audio location to 
determine which conference participant is the current speaker, motion detection to determine 
which conference participant is talking, gesturing, shaking his or her head, moving in a 
particular manner, speaking in a particular manner, etc. 

In a video surveillance application, the object of interest may be a person taking 

10 a particular action, e.g., entering or leaving a restricted area or engaging in suspicious 

behavior, a child moving about in a room of a home, a vehicle entering or leaving a parking 
garage, etc. The output of the detection and tracking operation 32 includes information 
identifying the particular object of interest 22-*, which is shown as shaded in the image 42. 

The particular type of detection and tracking mechanisms used in operation 32 

IS will generally vary depending upon the application. Conventional detection and tracking 
techniques which may be used in operation 32 include those described in, e.g., C. Wren, A. 
Azarbayejani, T. Darrell, A. Pentland. APfinder: Real-time Tracking of the Human Body,= 
IEEE Trans. PAMI, 19(7):780-785, July 1997; H. Rowley, S, Bluja, T. Kanade, ARotation 
Invariant Neural Network-Based Face Detection = Proc. IEEE Conf on Computer Vision, 

20 pp.38-44, June 1998; and A. Lipton, H. Fujiyoshi, R. Patil, AMoving Target Classification and 
Tracking from Real-Time Video,- Proc. IEEE Workshop on Application of Computer Vision, 
pp.8-14, Oct 1998. 

The optical zooming operation 34 of Fig. 2 provides a sufficient amount of 
zooming to ensure that a desired output image quality can be achieved, while also allowing for 

25 a certain amoimt of movement of the object of interest. The optical zooming operation 34 

includes a framing portion with pan and tilt operations for framing the object of interest 22-*, 
followed by a zooming portion with a zooming operation that continues until designated 
stopping criteria are satisfied. 

Assuming that the radial distortion of the camera lens is negligible, the 

30 following approach can be used to estimate the required amount of pan and tilt in the firaming 
portion of operation 34. Suppose the object of interest 22-^ is detected in operation 32 as 
being located at a pixel coordinate position (x, y) in image 42. The firaming portion of 
operation 34 adjusts the pan and tilt of camera 1 8 such that the object of interest appears in the 
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center (Cx,Cy) of the image. Let ZF be the current zoom factor, a?^ be the current camera pan 
angle, ar^ be the current camera tilt angle, and D be the number of degrees per pixel, as 
predetermined when the camera zoom factor ZF = 1 . The new pan angle a?^ and new tilt 
angle aj^ are then given by: 

ap^ = ap^ + D*((x-c,)/ZF), 

ttT^ = ttT^ + D*i(y-CyyZF). 

Other techniques may also be used to determine the appropriate pan and tilt adjustments for 
the framing portion of operation 34. For example, techniques for determining pan and tilt in 
the presence of radial distortion of the camera lens will be apparent to those skilled in the art. 

After completion of the framing portion of operation 34, the zooming portion of 
operation 34 is commenced. As previously noted, this portion of operation 34 involves an 
optical zooming which continues until one or more designated stopping criteria are satisfied. 
There are a number of different types of stopping criteria which may be used. In a fixed 
stopping criteria approach, the optical zooming continues until the object of interest occupies a 
fixed percentage of the image. For example, in a video conferencing system, the optical 
zooming may continue until the head of the current speaker occupies between about 25% and 
35% of the vertical size of the image. Of course, the specific percentages used will vary 
depending upon the tracking application. The specific percentages suitable for a particular 
application can be determined in a straightforward maimer by those of ordinary skill in the art. 

In a dynamic stopping criteria approach, the optical zooming again continues 
until the object of interest reaches a designated percentage of the image, but the percentage in 
this approach is a function of another detected quality associated with the object of interest. 
For example, the percentage may vary as a function of qualities such as level of apparent 
motion, use of particular key words or other audio or speech cues, change in intensity, pitch or 
other voice quality, etc. Again, the specific percentages and the manner in which they vary 
based on the detected qualities will generally depend upon the particular tracking application, 
and can be determined in a straightforward maimer by those skilled in the art. 

The result of the optical zooming operation 34 is an optically-zoomed image 
44, in which the object of interest 22-k is centered within the image and occupies a desired 
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percentage of the image as determined based on the above-described fixed or dynamic 
stopping criteria. The image 44 may be stored by the system 10, e.g., in memory 14. 

The virtual zooming operation 36 is then performed on the optically-zoomed 
image 44. This virtual zooming operation first extracts a region of interest from the image 44. 
5 For example, in the video conferencing application, a region of interest 47 may be identified 
as the head and shoulders of the current object of interest 22'k, In video surveillance 
applications, the region of interest may be the hands, feet, head, body or other designated 
portion of the object of interest. The identification of the region of interest may be a dynamic 
process, e.g., it may be selected by an operator based on the current tracking objectives. The 

1 0 region of interest may be identified and extracted using knovm techniques, e.g., the techniques 
described in the references cited above in conjunction with detection of the object of interest. 
The extracted region of interest is then interpolated using well-known image interpolation 
techniques to generate a video output which includes the virtually-zoomed image 46. The 
image 46 thus represents a virtual zoom of the optically-zoomed image 44. 

IS It should be noted that the virtual zooming operation 36 may be performed in a 

different system than the detection and tracking operation 32 and optical zooming operation 
34. For example, the image 44 may be compressed and then transmitted from the system 1 0 
via the I/O device 15, with the virtual zooming operation being performed in signal processing 
elements of a corresponding receiver. 

20 Advantageously, the hybrid mechanism 30 allows for a certain amount of 

movement on the part of the object of interest, while preserving a desired level of image 
quality in the video output. For example, if the object of interest 22-A moves, the virtual 
zooming operation 36 can be repeated using the same pan, tilt and zoom settings determined in 
the optical zooming operation 34. In this case, the extraction and interpolation operations of 

25 the virtual zoom can result in an output image in which the object of interest 22-k remains 
substantially centered in the image. 

The hybrid mechanism 30 can incorporate multiple thresholds for determining 
when the virtual zooming and optical zooming operations should be repeated. For example, if 
a given amount of movement of the object of interest exceeds a first threshold, the virtual 

30 zooming operation 36 may be repeated with the pan, tilt and zoom settings of the camera 

tmchanged. If the given amount of movement exceeds a second, higher threshold, the optical 
zooming step 34 may be repeated to determine new pan, tilt and zoom settings, and then the 
virtual zooming operation 36 is repeated to obtain the desired output image 46. A feedback 
path 48 is included between the optical zooming operation 34 and the detection and tracking 
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operation 32 such that the detection and tracking operation can be repeated if necessary, e.g., 
in the event that the optical zooming operation detects a substantial movement of the object of 
interest such that it can no longer track that object 

The hybrid optical and virtual zoom mechanism of the invention provides a 
5 number of significant advantages over conventional approaches. As described previously, the 
hybrid mechanism accommodates some movement of the object of interest without the need to 
determine new optical pan, tilt and zoom settings, while also providing a desired output image 
quality level. By preventing the PTZ camera from zooming in too far, the invention ensures 
that the PTZ camera settings are adjusted less frequently, and the computational load on the 

10 system processor is thereby reduced relative to that required by a conventional optical zoom 
approach. In addition, the hybrid mechanism of the invention can provide an improved 
compression rate for image transmission. For example, as noted above, the virtual zoom 
operation can be performed after an image is transmitted from the system 10 to a receiver via 
the I/O device IS. Consequently, the proportion of the object in the transmitted image is lower 

1 5 than it would otherwise be using a conventional approach, thereby allowing for less 
compression and an improved compression rate. 

The above-described embodiment of the invention is intended to be illustrative 
only. For example, the invention can be used to implement real-time tracking of any desired 
object of interest, and in a vsdde variety of applications, including video conferencing systems, 

20 video surveillance systems, and other camera-based systems. In addition, although illustrated 
using a system with a single PTZ camera, the invention is also applicable to systems with*" 
multiple PTZ cameras, and to systems with other types and arrangements of image capture 
devices. Moreover, the invention can utilize many different types of techniques to detect and 
track an object of interest, and to extract and interpolate a region of interest. The invention 

25 can also be implemented at least in part in the form of one or more software programs which 
are stored on an electronic, magnetic or optical storage medium and executed by a processing 
device, e.g., by the processor 12 of system 10. These and numerous other embodiments within 
the scope of the following claims will be apparent to those skilled in the art. 
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CLAIMS: 



1 . A method for tracking an object of interest (22-k) in a video processing system 

(10), the method comprising the steps of: 

detecting the object of interest in a first image (40) generated by a camera (18); 

performing an optical zooming operation (34) to establish at least a zoom 
5 setting for the camera in accordance with one or more designated stopping criteria based on 
the object of interest; and 

performing a virtual zooming operation (36) on a second image (44) generated 
by the camera at the established setting. 

10 2. An apparatus for tracking an object of interest (22-A) in a video processing 

system (10), the apparatus comprising: 
a camera (18); and 

a processor (12) coupled to the camera and operative to detect the object of 
interest in a first image (40) generated by the camera, wherein the processor directs the 
1 5 performance of (i) an optical zooming operation (34) to establish at least a zoom setting for the 
camera in accordance with one or more designated stopping criteria based on the object of 
interest, and (ii) a virtual zooming operation (36) on a second image (44) generated by the 
camera at the established setting. 

20 3. The apparatus of claim 2 wherein the camera is a pan-tilt-zoom (PTZ) camera 

having adjustable pan, tilt and zoom settings. 

4. The apparatus of claim 3 wherein the optical zooming operation includes 

fi-aming the object of interest by adjusting the pan and tilt settings of the camera, and 
25 performing an optical zoom on the framed object of interest until the designated stopping 
criteria is met. 
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5. The apparatus of claim 2 wherein the designated stopping criteria indicates that 

the optical zooming continues until the object of interest occupies a percentage of a resulting 
image. 

5 6. The apparatus of claim 5 wherein the percentage is a fixed percentage. 

7. The apparatus of claim 5 wherein the percentage varies as a function of a 

detected quality associated with the object of interest. 

10 8. The apparatus of claim 7 wherein the detected quality associated with the object 

of interest includes at least one of a level of apparent motion, a use of a particular audibly- 
detectable cue, and a change in a voice quality. 

9. The apparatus of claim 2 wherein the virtual zooming operation includes 

1 5 identifying a region of interest (47) in the second image, extracting the region of interest, and 
interpolating the extracted region of interest to generate a third image (46). 

10; The apparatus of claim 3 wherein the processor is further operative to direct a 

repeating of the virtual zooming operation on the second image using pan, tilt and zoom 
20 settings established in the optical zooming operation if a level of movement of the object of 
interest exceeds a first threshold. 

1 1 . The apparatus of claim 10 wherein the processor is further operative to direct a 
repeating of the optical zooming operation in order to establish at least one new setting for the 

25 camera if the level of movement of the object of interest exceeds a second threshold higher 
than the first threshold. 

12. The apparatus of claim 2 wherein the video processing system comprises a 
video conferencing system. 

30 

13. The apparatus of claim 2 wherein the video processing system comprises a 
video surveillance system. 
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14. An article of manufacture comprising a storage medium (14) for storing one or 

more programs which when executed by a processing system (10) implement the steps of: 

detecting an object of interest (22-/r) in a first image (40) generated by a camera 

(18); 

5 performing an optical zooming operation (34) to establish at least a zoom 

setting for the camera in accordance with one or more designated stopping criteria based on 
the object of interest; and 

performing a virtual zooming operation (36) on a second image (44) generated 
by the camera at the established setting. 
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